Bioinformatic workflows: G-PIPE as an implementation

نویسندگان

  • Alexander Garcia
  • Samuel Thoraval
  • Leyla J. Garcia
  • Yi-Ping Phoebe Chen
  • Mark A. Ragan
چکیده

We present G-PIPE, a graphic pipeline generator for PISE that allows the definition of pipelines, parameterization of its component methods, and storage of metadata in XML formats. Our implementation goes beyond macro capacities currently in PISE. As the entire analysis protocol is defined in XML, a complete bioinformatic experiment (linked sets of methods, parameters and results) can be reproduced or shared among users. We also discuss the role of ontologies as as guidance systems in order to provide users with the possibility to define abstract work-flows, and execute them. A relevant baseline ontology is presented. Availability: http://if-web.imb.uq.edu.au INTRODUCTION Computational methods of problem solving need to interleave information access and algorithm execution in a problem-specific workflow. In complex domains like molecular biosciences, workflows usually involve iterative steps of querying, analysis and optimisation. Bioinformatic experiments are often workflows; they link analysis methods that typically accept an input file, compute a result, and present an output file. Query workflows are sometimes implemented over relational database management systems (Wong 2000; Haas et al. 2001) and in such cases can be built using SQL statements. Analysis workflows, on the other hand, provide a path to discover information beyond the capacities of simple query statements, but are much less easy to implement within a common environment. Systems such as W2H (Ernst et al. 2003) and PISE (Letondal 2001) provide some tools that allow methods to be combined. W3H (Carver and Mullan 2002) is a task framework that allows integration of methods available under W2H. In the case of PISE, the user can either define a macro using BioPerl (www.bioperl.org), or use the interface provided and register the resulting macro. Macros cannot be exchanged between PISE and W2H although they provide GUIs for more or less the same set of methods (EMBOSS: Rice et al. 2000). Indeed, macros cannot be shared even among PISE users. G-PIPE provides a real capacity for users to share and define complete experiments (methods, parameters, and meta-information), substantially mitigating the syntactic complexity that this process involves. We have tested G-PIPE by defining different pipelines and exchanging results between different PISE servers. One of these pipelines was PATH (Del Val et al. 2002). SYNTACTIC COMPONENTS AND WORKFLOW TERMINOLOGY The workflow language presented here closely follows the concepts presented by Lei and Singh (1997) and Stevens et al. (2001). We have adapted these meta-models to bioinformatic analysis processes. We present key definitions below; a more-extensive presentation of these terms and concepts is in preparation: · Input data object: a collection of input data. · Transformer: the atomic work item in a workflow. In analysis workflows, it is an implementation of an analysis algorithm (analysis method). · Pipe component: the entity that contains the required input-output relation (e.g. information about the previous and subsequent tasks); assures syntactic coherence. · Output object: the result of the transformation applied to one or more input data objects. · Task: a defined piece of work. One analysis method, together with its parameters and input data object(s), constitutes a task. · Stage: an instance within a workflow, containing tasks, annotation, and output. A stage can contain more than one transformer. · Workflow: a group of stages with interdependencies. It is a process bound to a particular resource that fulfills the process. · Parameters: experimental conditions relevant to a particular transformer. · Annotation: meta-information relevant to the experiment. Annotation is the main source of knowledge by which other researchers can understand and reproduce the experiment. · Protocol: a set of information that describes an experiment. A protocol contains workflows, annotations, and information about the raw data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Guided Composition of Tasks with Logical Information Systems - Application to Data Analysis Workflows in Bioinformatics

In a number of domains, particularly in bioinformatics, there is a need for complex data analysis. For that issue, elementary data analysis operations called tasks are composed as workflows. The composition of tasks is however difficult due to the distributed and heterogeneous resources of bioinformatics. This doctorial work will address the composition of tasks using Logical Information System...

متن کامل

Cellarium: A Computational Biology Workflowing Environment

Cellarium is a Computational Biology Workflowing Environment. It is designed to simplify the creation and refinement of workflows. There are a large variety of bioinformatic disparate components: data sources, algorithms, visualizations, etc.. Cellarium provides a means of leveraging any such component, providing greater interoperability of components than ad-hoc approaches. Cellarium also prov...

متن کامل

Microbiome Helper: a Custom and Streamlined Workflow for Microbiome Research

Sequence-based approaches to study microbiomes, such as 16S rRNA gene sequencing and metagenomics, are uncovering associations between microbial taxa and a myriad of factors. A drawback of these approaches is that the necessary sequencing library preparation and bioinformatic analyses are complicated and continuously changing, which can be a barrier for researchers new to the field. We present ...

متن کامل

Fasta-O-Matic: a tool to sanity check and if needed reformat FASTA files

Background: As the sheer volume of bioinformatic sequence data increases, the only way to take advantage of this content is to more completely automate robust analysis workflows. Analysis bottlenecks are often mundane and overlooked processing steps. Idiosyncrasies in reading and/or writing bioinformatics file formats can halt or impair analysis workflows by interfering with the transfer of dat...

متن کامل

Integrative workflows for metagenomic analysis

The rapid evolution of all sequencing technologies, described by the term Next Generation Sequencing (NGS), have revolutionized metagenomic analysis. They constitute a combination of high-throughput analytical protocols, coupled to delicate measuring techniques, in order to potentially discover, properly assemble and map allelic sequences to the correct genomes, achieving particularly high yiel...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005